blablabla
The datasets used in the below analysis were sourced from www.kaggle.com website 1. They were created based on several sources including the Bureau of Justice Statistics 2 and FBI Uniform Crime Reporting Program 3. The National Prisoner Statistics Program conducted by the Bureau of Justice Statistics has collected data on the number of prisoners in state and federal prison facilities since 1926. It is produced annually on national and state level. Data are sourced from the 50 state departments of correction, the Federal Bureau of Prisons, and until 2001, from the District of Columbia. The UCR Program provides statistics on violent and property crimes. Data are collected annually and are available on national, state and city level. For the purposes of our analysis we are using state-level statistics.
Additionally, we individually collected data on prison expenditures provided by the Bureau of Justice Statistics 4 for each state in 2016 which is the lastest data available. Later in the analysis we will use them in order to correlate the expendirutes with the occurence of particular crimes.
The UCR dataset consist of 15 variables, two of which are the jurisdiction and year of the observation. It provides information about the state population and also about number of violent crimes (murder, manslaughter, rape, robbery, aggravated assault) and property crimes (burglary, larceny, vehicle theft) per state yearly. Detailed definitions of each crimes can be found on UCR Program website.
The crime_reporting_change variable reflects instances when states’ reporting standards changed. The crimes_estimated variable indicates cases where the FBI computes estimates for participating agencies not providing 12 months of complete data for state 5.
ucr <- read_csv("data/ucr_by_state.csv")
colnames(ucr)
## [1] "jurisdiction" "year"
## [3] "crime_reporting_change" "crimes_estimated"
## [5] "state_population" "violent_crime_total"
## [7] "murder_manslaughter" "rape_legacy"
## [9] "rape_revised" "robbery"
## [11] "agg_assault" "property_crime_total"
## [13] "burglary" "larceny"
## [15] "vehicle_theft" "X16"
## [17] "X17" "X18"
## [19] "X19" "X20"
## [21] "X21"
The ucr dataset has a lot of missing values, compared to the other datasets that have none. We dropped the last 6 columns that were completely empty and then we dropped rows consisting of only missing values. It leaves all columns without any missing values apart from “rape_revised” with 612 missing values and “rape_legacy” with 104 missing values.
# removing last 6 columns
ucr <- ucr[, -c(16:21)]
# removing all missing rows
ind <- apply(ucr, 1, function(x) all(is.na(x)))
ucr <- ucr[ !ind, ]
# showing sum of missing values per columns
sapply(ucr, function(x) sum(is.na(x)))
## jurisdiction year crime_reporting_change
## 0 0 0
## crimes_estimated state_population violent_crime_total
## 0 0 0
## murder_manslaughter rape_legacy rape_revised
## 0 104 612
## robbery agg_assault property_crime_total
## 0 0 0
## burglary larceny vehicle_theft
## 0 0 0
As you can see on plot on the left below, in the last two years, 2016 and 2017, there is an additional obervation ie. jurisdiction. Looking at the plot on the right, New York is missing in one year, Puerto Rico is visible in only 3 years. District of Columbia is sometimes renamed as DC, but overall it sums up to all 17 years.
library(viridis)
## Loading required package: viridisLite
plot.data1 = ucr %>% group_by(year) %>% count()
ggp1 = ggplot(data = plot.data1, aes(x=year, y=n, fill=year)) +
geom_bar(stat = "identity") +
scale_fill_viridis() +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "none")
plot.data2 = ucr %>% group_by(jurisdiction) %>% count() %>% arrange(n) %>% filter(n<17)
ggp2 = ggplot(data = plot.data2, aes(x=jurisdiction, y=n, fill=jurisdiction)) +
geom_bar(stat = "identity") +
theme_minimal() +
scale_fill_viridis_d() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "none")
grid.arrange(ggp1, ggp2, ncol = 2)
Based on the above analysis, we decided to rename “DC” to “District of Columbia” and exclude Puerto Rico state.
ucr$jurisdiction[ucr$jurisdiction=="DC"] <- "District of Columbia"
ucr <- ucr %>% filter(jurisdiction!="Puerto Rico")
# interpolation <- data %>%
# group_by(country) %>%
# mutate(valueIpol = approx(year, women_part, year,
# method = "linear", rule = 1:2, f = 0, ties = mean)$y)
# i=0
# for (i in seq_along(interpolation$valueIpol)) {
# if (is.na(interpolation$women_part[i]) == FALSE)
# i = i+1
# else if (is.na(interpolation$women_part[i]) == TRUE)
# interpolation$women_part[i] <- interpolation$valueIpol[i]
# }
We also analysed the missing values of variables rape_revised and rape_legacy. Because there are so many missings and they mostly do not occur in the same year, we can’t compare them and that’s why we decided to drop them.
rape_df <- data.frame(year=2001:2017)
rape_revised_count <- ucr[!is.na(ucr$rape_revised),] %>%
group_by(year) %>%
count(name="rape_revised_count")
rape_legacy_count <- ucr[!is.na(ucr$rape_legacy),] %>%
group_by(year) %>%
count(name="rape_legacy_count")
rape_df <- left_join(rape_df, rape_revised_count, by="year")
rape_df <- left_join(rape_df, rape_legacy_count, by="year")
kable(rape_df)
| year | rape_revised_count | rape_legacy_count |
|---|---|---|
| 2001 | NA | 51 |
| 2002 | NA | 51 |
| 2003 | NA | 51 |
| 2004 | NA | 51 |
| 2005 | NA | 51 |
| 2006 | NA | 51 |
| 2007 | NA | 51 |
| 2008 | NA | 51 |
| 2009 | NA | 51 |
| 2010 | NA | 51 |
| 2011 | NA | 51 |
| 2012 | NA | 51 |
| 2013 | 51 | 51 |
| 2014 | 51 | 51 |
| 2015 | 50 | 50 |
| 2016 | 51 | NA |
| 2017 | 51 | NA |
ucr$rape_legacy <- NULL
ucr$rape_revised <- NULL
colnames(ucr)
## [1] "jurisdiction" "year"
## [3] "crime_reporting_change" "crimes_estimated"
## [5] "state_population" "violent_crime_total"
## [7] "murder_manslaughter" "robbery"
## [9] "agg_assault" "property_crime_total"
## [11] "burglary" "larceny"
## [13] "vehicle_theft"
pl <- vector("list", length = ncol(ucr[,c(5:13)])-1)
colors <- viridis(8)
for(ii in seq_along(pl)){
.col <- colnames(ucr[,c(5:13)])[-1][ii]
.p <- ggplot(ucr, aes_string(x=.col, fill="colors[ii]", color="colors[ii]")) +
geom_density(alpha=0.3) +
scale_fill_manual(values = colors[ii], aesthetics = c("color", "fill")) +
theme_minimal() +
theme(legend.position = "none",
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
labs(title = .col)
pl[[ii]] <- .p
}
grid.arrange(grobs=pl)
(…)
prison <- read_csv("data/prison_custody_by_state.csv")
head(prison)
## # A tibble: 6 x 18
## jurisdiction includes_jails `2001` `2002` `2003` `2004` `2005` `2006`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Federal 0 149852 158216 168144 177600 186364 190844
## 2 Alabama 0 24741 25100 27614 25635 24315 24103
## 3 Alaska 1 4570 4351 4472 4534 4798 5052
## 4 Arizona 0 27710 29359 31084 32384 33345 35752
## 5 Arkansas 0 11489 11849 12068 12577 12455 12854
## 6 California 0 157142 159695 161785 163939 168035 172298
## # ... with 10 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## # `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## # `2015` <dbl>, `2016` <dbl>
The prison data, compared to ucr is in a panel form, consisting of years as columns. Using long_panel we converted the dataframe so that each row is a different jurisdiction and year.
colnames(prison)[3:18] <- paste0(colnames(prison)[3:18],'1')
prison_panel <- long_panel(prison, begin = 2001, end = 2016, label_location = "beginning", id = "jurisdiction")
names(prison_panel)[names(prison_panel) == "wave"] <- "year"
names(prison_panel)[names(prison_panel) == "1"] <- "prison"
kable(head(prison_panel))
| jurisdiction | year | includes_jails | prison |
|---|---|---|---|
| Alabama | 2001 | 0 | 24741 |
| Alabama | 2002 | 0 | 25100 |
| Alabama | 2003 | 0 | 27614 |
| Alabama | 2004 | 0 | 25635 |
| Alabama | 2005 | 0 | 24315 |
| Alabama | 2006 | 0 | 24103 |
prison_exp_2016 <- read_delim("data/prison_expenditures.csv", ";")
head(prison_exp_2016)
## # A tibble: 6 x 2
## `State and type of government` `prison expenditure`
## <chr> <dbl>
## 1 Alabama 722269
## 2 Alaska 338005
## 3 Arizona 1684710
## 4 Arkansas 595731
## 5 California 15468283
## 6 Colorado 1313103
In order to enhance further visualisations, we add an information about state area and region based on R built-in us_states dataset.
library(spData)
library(sf)
us_states_info <- data.frame(jurisdiction = us_states$NAME,
region = us_states$REGION,
area_km2 = as.numeric(round(us_states$AREA, 0)))
us_states_info
## jurisdiction region area_km2
## 1 Alabama South 133709
## 2 Arizona West 295281
## 3 Colorado West 269573
## 4 Connecticut Norteast 12977
## 5 Florida South 151052
## 6 Georgia South 152725
## 7 Idaho West 216513
## 8 Indiana Midwest 93648
## 9 Kansas Midwest 213037
## 10 Louisiana South 122346
## 11 Massachusetts Norteast 20911
## 12 Minnesota Midwest 218566
## 13 Missouri Midwest 180716
## 14 Montana West 380829
## 15 Nevada West 286364
## 16 New Jersey Norteast 20274
## 17 New York Norteast 127202
## 18 North Dakota Midwest 183178
## 19 Oklahoma South 180971
## 20 Pennsylvania Norteast 117242
## 21 South Carolina South 80904
## 22 South Dakota Midwest 199767
## 23 Texas South 687714
## 24 Vermont Norteast 24866
## 25 West Virginia South 62813
## 26 Arkansas South 137690
## 27 California West 409747
## 28 Delaware South 5182
## 29 District of Columbia South 178
## 30 Illinois Midwest 145993
## 31 Iowa Midwest 145744
## 32 Kentucky South 104458
## 33 Maine Norteast 85520
## 34 Maryland South 26849
## 35 Michigan Midwest 151119
## 36 Mississippi South 123745
## 37 Nebraska Midwest 200272
## 38 New Hampshire Norteast 24026
## 39 New Mexico West 314886
## 40 North Carolina South 129233
## 41 Ohio Midwest 107051
## 42 Oregon West 251346
## 43 Rhode Island Norteast 2743
## 44 Tennessee South 109114
## 45 Utah West 219860
## 46 Virginia South 105405
## 47 Washington West 175436
## 48 Wisconsin Midwest 144954
## 49 Wyoming West 253310
Because of the fact that there are two states missing in the us_states_info dataset, we manually added region and land area for Hawaii and Alaska 6.
additional_states <- data.frame(jurisdiction = c("Hawaii", "Alaska"),
region = c("remote", "remote"),
area_km2 = c(16638, 1481346 ))
us_states_info <- rbind(us_states_info, additional_states)
us_states_info
## jurisdiction region area_km2
## 1 Alabama South 133709
## 2 Arizona West 295281
## 3 Colorado West 269573
## 4 Connecticut Norteast 12977
## 5 Florida South 151052
## 6 Georgia South 152725
## 7 Idaho West 216513
## 8 Indiana Midwest 93648
## 9 Kansas Midwest 213037
## 10 Louisiana South 122346
## 11 Massachusetts Norteast 20911
## 12 Minnesota Midwest 218566
## 13 Missouri Midwest 180716
## 14 Montana West 380829
## 15 Nevada West 286364
## 16 New Jersey Norteast 20274
## 17 New York Norteast 127202
## 18 North Dakota Midwest 183178
## 19 Oklahoma South 180971
## 20 Pennsylvania Norteast 117242
## 21 South Carolina South 80904
## 22 South Dakota Midwest 199767
## 23 Texas South 687714
## 24 Vermont Norteast 24866
## 25 West Virginia South 62813
## 26 Arkansas South 137690
## 27 California West 409747
## 28 Delaware South 5182
## 29 District of Columbia South 178
## 30 Illinois Midwest 145993
## 31 Iowa Midwest 145744
## 32 Kentucky South 104458
## 33 Maine Norteast 85520
## 34 Maryland South 26849
## 35 Michigan Midwest 151119
## 36 Mississippi South 123745
## 37 Nebraska Midwest 200272
## 38 New Hampshire Norteast 24026
## 39 New Mexico West 314886
## 40 North Carolina South 129233
## 41 Ohio Midwest 107051
## 42 Oregon West 251346
## 43 Rhode Island Norteast 2743
## 44 Tennessee South 109114
## 45 Utah West 219860
## 46 Virginia South 105405
## 47 Washington West 175436
## 48 Wisconsin Midwest 144954
## 49 Wyoming West 253310
## 50 Hawaii remote 16638
## 51 Alaska remote 1481346
In the prison dataset, District of Columbia is named as Federal and in prison_exp_2016 is named as Washington, D.C., so in order to unify the names we ranamed both to District of Columbia. We also renamed the variable State and type of government to jurisdiction for easier further calculations.
setdiff(prison$jurisdiction %>% unique(), ucr$jurisdiction %>% unique())
## [1] "Federal"
setdiff(prison_exp_2016$`State and type of government` %>% unique(), ucr$jurisdiction %>% unique())
## [1] "Washington, D.C."
setdiff(ucr$jurisdiction %>% unique(), us_states_info$jurisdiction %>% unique())
## character(0)
prison$jurisdiction[prison$jurisdiction=="Federal"] <- "District of Columbia"
names(prison_exp_2016)[names(prison_exp_2016) == "State and type of government"] <- "jurisdiction"
prison_exp_2016$jurisdiction[prison_exp_2016$jurisdiction=="Washington, D.C."] <- "District of Columbia"
According to recent surveys regarding the United States expenditures, spendings on incarceration have increased about three times as fast as spendings on elementary and secondary education during this time period. (…)
# p <- ggplot(data = df, aes(x = year, y = value, group = 1,
# text = paste("Year: ", year,
# "<br>Number of prisoners:", value))) +
# geom_line() +
# geom_point() +
# # scale_color_viridis() +
# # scale_fill_viridis() +
# labs(title = "Number of prisoners in the USA by year", x = "Year", y = "Number of prisoners") +
# theme_minimal()
#
# ggplotly(p, tooltip = "text")
Below can be seen maps of US states and in
ucr_grouped <- ucr %>%
group_by(jurisdiction) %>%
summarise(
violent_crime_total = mean(violent_crime_total),
property_crime_total = mean(property_crime_total))
names(ucr_grouped)[names(ucr_grouped) == "jurisdiction"] <- "NAME"
ucr_grouped
## # A tibble: 51 x 3
## NAME violent_crime_total property_crime_total
## <chr> <dbl> <dbl>
## 1 Alabama 20979. 169904.
## 2 Alaska 4581. 22255.
## 3 Arizona 29705. 255059.
## 4 Arkansas 14376. 104902.
## 5 California 180014. 1076578.
## 6 Colorado 17139. 153682.
## 7 Connecticut 9864. 81138
## 8 Delaware 5212. 28289.
## 9 District of Columbia 8269. 30596.
## 10 Florida 111554. 682763.
## # ... with 41 more rows
us_states_ucr <- merge(us_states, ucr_grouped, by = "NAME")
us_states_ucr$violent_crime_per_pop <- us_states_ucr$violent_crime_total/us_states_ucr$total_pop_15
us_states_ucr$property_crime_per_pop <- us_states_ucr$property_crime_total/us_states_ucr$total_pop_15
ggp1 <- ggplot(data = us_states_ucr) +
geom_sf(aes(fill = property_crime_per_pop)) +
scale_fill_viridis_c(option = "viridis", trans = "sqrt")
ggp2 <- ggplot(data = us_states_ucr) +
geom_sf(aes(fill = violent_crime_per_pop)) +
scale_fill_viridis_c(option = "viridis", trans = "sqrt")
jaka jest zależność między liczbą więźniów (prison) a wystąpieniami poszczególnych crime na przestrzeni lat (ucr)? czy wzrost uwięzionych zminiejsza odsetek jakiegoś typu przestępstw? czy może jest stały wzrost/spadek przestępstw? (geom line i geom smooth)
Does this significant investment into imprisonment improve public safety? wydatki na więzienia a wystąpienia przestępstw - ogółem i w kategoriach, w roku 2016 (najnowsze dane); source: https://www.bjs.gov/index.cfm?ty=dcdetail&iid=286
jak wygląda liczba uwięzionych na przestrzeni lat? dla całego kraju i dla poszczególnych stanów?
prison_country <- prison[,c(3:18)]
prison_country <- sapply(prison_country, sum)
df <- stack(prison_country)
colnames(df) <- c("value", "year")
p <- ggplot(data = df, aes(x = year, y = value, group = 1,
text = paste("Year: ", year,
"<br>Number of prisoners:", value))) +
geom_line() +
geom_point() +
# scale_color_viridis() +
# scale_fill_viridis() +
labs(title = "Number of prisoners in the USA by year", x = "Year", y = "Number of prisoners") +
theme_minimal()
ggplotly(p, tooltip = "text")
dodatkowe zmienne -> area (ok) - w kodzie -> wydatki na prisons (ok) - w excelu , 2016
-> co poza mapą i bombelkami? - heatmapa -
-> https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/
Source: https://www.kaggle.com/christophercorrea/prisoners-and-crime-in-united-states#ucr_by_state.csv↩
Source: https://www.ucrdatatool.gov/Search/Crime/State/RunCrimeStatebyState.cfm↩
“For agencies supplying 3 to 11 months of data, the national UCR Program estimates for the missing data by following a standard estimation procedure using the data provided by the agency. If an agency has supplied less than 3 months of data, the FBI computes estimates by using the known crime figures of similar areas within a state and assigning the same proportion of crime volumes to nonreporting agencies.” (cited from https://www.ucrdatatool.gov/faq.cfm)↩
Sources: https://en.wikipedia.org/wiki/Alaska and https://en.wikipedia.org/wiki/Hawaii↩